This section offers an in-depth exploration of the structure, statistics, and spread of the data found in our "member model data" and "city model data" tables. It presents a comprehensive snapshot of the data's features and subtleties.
# Dependencies and Setup
import package as pk
Package: Resources loaded. ☑
# Instantiate the 'DataToSpark' class to handle KKBOX data
kkbox = pk.DataToSpark("customer-churn-391917", "kkbox")
# Load specific tables into DataFrames
tables_to_load = ["member_model_main","city_model_main"]
kkbox.load_tables(tables_to_load)
# Retrieve the stored DataFrames
table = kkbox.get_tables()
# Create an instance of the class with models
member_model_df = table["member_model_main"]
city_model_df = table["city_model_main"]
# Call the 'bar_chart' function from plots.py to create a churn bar chart
pk.bar_chart([member_model_df, city_model_df], 'is_churn')
In our dataset, churn represents whether users discontinue their subscription or service. Upon analyzing the member_model and city_model data, we found a pronounced trend towards the "No" churn category: 92.07% in member_model and 95.64% in city_model. This indicates a strong retention rate for the users in both categories.
# Use the 'bar_scatter_chart' from plots.py to view city data.
pk.bar_scatter_chart(member_model_df, 'city')
The member dataset shows that 51.92% of the entries are from a particular city, referred to as 'city 1'. Following this, 'city 13' and 'city 5' have the highest frequencies. Furthermore, there's no observable correlation between the city of origin and the churn rate.
Based on Age Category:
# Using 'bucketized' from func.py and 'bar_scatter_chart' from plots.py, categorize and display the 'bd' column.
# member_model_df
member_bd_df=pk.bucketized(member_model_df, 'bd', 'age_group', [1, 13, 20, 35, 65])
pk.bar_scatter_chart(member_bd_df, 'age_group','member_model_df')
# city_model_df
city_bd_df=pk.bucketized(city_model_df, 'bd', 'age_group', [1, 13, 20, 35, 65])
pk.bar_scatter_chart(city_bd_df, 'age_group', 'city_model_df')
Both datasets exhibit an unclear category, with the city dataset displaying the highest frequency in this category. Notably, the age group categorized as Adolescents (13-19 years, labeled as Category 2) showcases the highest churn percentage. This group's churn rate lies outside the typical statistical range for age in these datasets.
Based on Gender Category:
# Use the 'bar_scatter_chart' from plots.py to view gender data.
pk.bar_scatter_chart(member_model_df, 'gender')
pk.bar_scatter_chart(city_model_df, 'gender')
Both datasets exhibit an unclear category, with the city dataset displaying the highest frequency in this category. Notably, the male category showcases the highest churn percentage.
# Use the 'bar_scatter_chart' from plots.py to view registration via data.
pk.bar_scatter_chart(member_model_df, 'registered_via')
pk.bar_scatter_chart(city_model_df, 'registered_via')
The most common registration method in the member model dataframe is "9", while in the city model dataframe, it's "7". However, for both dataframes, "registered via 4" shows the highest churn rate. Notably, this churn rate is statistically significant in the member model dataframe.
def generate_charts(model_df: pk.DataFrame, chart_title: str) -> None:
"""
Generate bar scatter and sub_line charts based on the provided DataFrame.
The function generates a bar scatter chart for registration years less than 2017,
and a sub_line chart for registration months within the years 2013-2016.
Args:
model_df (DataFrame): Input DataFrame with 'registration_init_time' column.
chart_title (str): Title of the chart.
Returns:
None
"""
# Extract the year from 'registration_init_time' column
date_df = model_df.withColumn("registration_year", pk.year(pk.col("registration_init_time")))
# Filter data to exclude records from the year 2017 and onwards
date_df_filter = date_df.filter(pk.col("registration_year") < 2017)
# Generate a bar scatter chart based on registration year
year_chart = pk.bar_scatter_chart(date_df_filter, 'registration_year', chart_title)
# Filter data to only include records from the last 4 years (2013-2016)
last_4_df = date_df.filter((pk.col("registration_year") >= 2013) & (pk.col("registration_year") < 2017))
# Extract month and day from 'registration_init_time'
last_4_df = last_4_df.withColumn("registration_month", pk.month("registration_init_time"))
last_4_df = last_4_df.withColumn("registration_day", pk.dayofmonth("registration_init_time"))
# Generate a sub_line chart based on registration month within the years 2013-2016
month_chart = pk.sub_line(last_4_df, ["registration_year", "registration_month"], [2013, 2014, 2015, 2016])
# member_model_df
generate_charts(member_model_df, "member_model_df")
23/09/28 03:55:24 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
# city_model_df
generate_charts(city_model_df, "city_model_df")
In both datasets, the annual registration frequency has increased. However, a more pronounced rise in churn rate is only evident in the member dataframe. Breaking down the churn percentages for the last four months reveals an average increase in churn for the member model dataframe, a trend not seen in the city model dataframe. Additionally, the city model dataframe exhibits more significant fluctuations in churn.
# Use the 'bar_scatter_chart' from plots.py to view Payment Method ID data.
pk.bar_scatter_chart(member_model_df, 'payment_method_id', None, 0.4)
pk.bar_scatter_chart(city_model_df, 'payment_method_id', None, 0.4)
Observations suggest that as the frequency of payment method ID increases, the percentage of churn decreases, indicating a strong inverse correlation.
Based on Subscription Category:
# Using 'bucketized' from func.py and 'bar_scatter_chart' from plots.py, categorize and display the 'Payment Plan Days' column.
# member_model_df
member_plan_df=pk.bucketized(member_model_df, 'payment_plan_days', 'plan_group', [31, 91, 181, 366])
pk.bar_scatter_chart(member_plan_df, 'plan_group', "member_model_df", 1)
# city_model_df
city_plan_df=pk.bucketized(city_model_df, 'payment_plan_days', 'plan_group', [31, 91, 181, 366])
pk.bar_scatter_chart(city_plan_df, 'plan_group', "member_model_df", 1)
Many customers opt for a monthly subscription.
Based on Price Category:
def plan_price(df: pk.DataFrame, params: pk.List[str] = []) -> None:
"""
Plots bar scatter charts for plan prices.
Args:
- df (DataFrame): The input dataframe.
- params (list of str): List of column names to be bucketized.
Returns:
None
"""
chart_title = pk.PltAssets(df).get_variable_name()
for param in params:
# Extract base name from the param for constructing column names
base_name = param.split('_')[0]
price_group_col = f"{base_name}_price_group"
# Bucketize the data based on given thresholds
bucketized_df = pk.bucketized(df, param, price_group_col, [1, 51, 101, 201])
# Plot the bar scatter chart
pk.bar_scatter_chart(bucketized_df, price_group_col, chart_title)
# member_model_df
plan_price(member_model_df, params=['plan_list_price','actual_amount_paid'])
# city_model_df
plan_price(city_model_df, params=['plan_list_price','actual_amount_paid'])
Customers opt for 51 to 100 (Category 2) and 101 to 200 (Category 3) price.
# member_model_df
pk.multiline(member_model_df,['is_cancel', 'is_auto_renew', 'is_churn'], "Frequency of Cancellation and Auto Renewal by Churn")
# city_model_df
pk.multiline(city_model_df,['is_cancel', 'is_auto_renew', 'is_churn'], "Frequency of Cancellation and Auto Renewal by Churn")
78.43% of customers in member_model_df and 89.67% of those in city_model_df are considered safe for business, exhibiting no cancellations, no churn, and possessing auto-renewal features. The proportion of high-risk customers is 5.65% in member_model_df and 2.48% in city_model_df, characterized by the absence of cancellations and auto-renewals, and the presence of churn. Additionally, around 2% of customers have shown churn alongside cancellations and auto-renewals.
pk.bar_scatter_chart(member_model_df, 'activity_count')
pk.bar_scatter_chart(city_model_df, 'activity_count')
In the member_model_df, there is observed growth in customer frequency corresponding to the number of days of activity, indicating an increase in daily user activity. However, for the city_model_df, the activity pattern exhibits a U-shape, suggesting variations in user engagement over time. Despite these differences, both DataFrames, member_model_df and city_model_df, show a consistent trend of decreasing churn rates with increased user activity.
def sum_plot(dfs: pk.List[pk.DataFrame], param: str, bins: pk.List[pk.Union[int, float]] = [101, 1001, 100]) -> None:
"""
Plots bar scatter charts for plan prices, bucketized by specified bins.
Args:
dfs (List[pd.DataFrame]): A list of input DataFrames.
param (str): The column name in the DataFrame to be bucketized.
bins (List[Union[int, float]], optional): The bin edges for bucketizing the column.
Defaults to [101, 1001, 100].
Returns:
None
"""
for df in dfs:
chart_title = pk.PltAssets(df).get_variable_name()
# Construct column name for bucketized data
price_group_col = f"{param}_group"
# Bucketize the data based on the provided bins
bucketized_df = pk.bucketized(df, param, price_group_col, bins)
# Plot the bar scatter chart
pk.bar_scatter_chart(bucketized_df, price_group_col, chart_title)
sum_plot([member_model_df,city_model_df], 'sum_num_25')
sum_plot([member_model_df,city_model_df], 'sum_num_50')
sum_plot([member_model_df,city_model_df], 'sum_num_75')
sum_plot([member_model_df,city_model_df], 'sum_num_985')
sum_plot([member_model_df,city_model_df], 'sum_num_100', [1,1000,50])
sum_plot([member_model_df,city_model_df], 'sum_num_unq', [1,1000,50])
sum_plot([member_model_df,city_model_df], 'total_secs', [0,1000000,50000])